findere: Fast and Precise Approximate Membership Query

Robidou, Lucas; Peterlongo, Pierre

doi:10.1007/978-3-030-86692-1_13

Lucas Robidou¹⁰ &
Pierre Peterlongo¹⁰

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 12944))

Included in the following conference series:

International Symposium on String Processing and Information Retrieval

436 Accesses
1 Citations
29 Altmetric

Abstract

Motivation: Approximate membership query (AMQ) structures such as Cuckoo filters or Bloom filters are widely used for representing large sets of elements. Their lightweight space usage explains their success, mainly as they are the only way to scale hundreds of billions or trillions of elements. However, they suffer by nature from non-avoidable false-positive calls that bias downstream analyses of methods using these data structures.

Results: In this work we propose a simple strategy and its implementation for reducing the false-positive rate of any AMQ data structure indexing \(k\)-mers (words of length k). The method we propose, called findere, enables to speed-up the queries by a factor two and to decrease the false-positive rate by two order of magnitudes. This achievement is done on the fly at query time, without modifying the original indexing data-structure, without generating false-negative calls and with no memory overhead.

This method yields so-called “construction false positives”, but the amount of such false positives is negligible when the method is used within classical parameter ranges. This method, as simple as effective, reduces either the false-positive rate or the space required to represent a set given a user-defined false-positive rate.

Availability: https://github.com/lrobidou/findere.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 54.99; Price excludes VAT (USA)

Softcover Book: USD 69.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Amid, C., et al.: The European nucleotide archive in 2019. Nucleic Acids Res. 48(D1), D70–D76 (2020)
Google Scholar
Bender, M.A., et al.: Don’t thrash: how to cache your hash on flash. Proc. VLDB Endow. 5(11), 1627–1637 (2012)
Article Google Scholar
Benoit, G., et al.: Multiple comparative metagenomics using multiset k-mer counting. PeerJ Comput. Sci. 2, e94 (2016)
Article Google Scholar
Bloom, B.H.: Space/time trade-offs in hash coding with allowable errors. Commun. ACM 13(7), 422–426 (1970)
Article Google Scholar
Bray, N.L., Pimentel, H., Melsted, P., Pachter, L.: Near-optimal probabilistic RNA-SEQ quantification. Nat. Biotechnol. 34(5), 525–527 (2016)
Article Google Scholar
Chikhi, R., Holub, J., Medvedev, P.: Data structures to represent a set of k -long DNA sequences. ACM Comput. Surv. 54(1), 1–22 (2021)
Article Google Scholar
Fan, B., Andersen, D.G., Kaminsky, M., Mitzenmacher, M.D.: Cuckoo filter: practically better than bloom. In: Proceedings of the 10th ACM International on Conference on emerging Networking Experiments and Technologies, pp. 75–88 (2014)
Google Scholar
HMP Integrative, Proctor, L.M., et al.: The integrative human microbiome project. Nature 569(7758), 641–648 (2019)
Google Scholar
Marchet, C., Boucher, C., Puglisi, S.J., Medvedev, P., Salson, M., Chikhi, R.: Data structures based on k-mers for querying large collections of sequencing data sets. Genome Res. 31(1), 1–12 (2021)
Article Google Scholar
Marchet, C., Iqbal, Z., Gautheret, D., Salson, M., Chikhi, R.: REINDEER: efficient indexing of k-mer presence and abundance in sequencing datasets. Bioinformatics, 36(Supplement\_1), i177–i185 (2020)
Google Scholar
Ondov, B.D., et al.: Mash: fast genome and metagenome distance estimation using MinHash. Genome Biol. 17(1), 132 (2016)
Article Google Scholar
Pellow, D., Filippova, D., Kingsford, C.: Improving bloom filter performance on sequence data using k -mer bloom filters. J. Comput. Biol. 24(6), 547–557 (2017)
Article Google Scholar
Stephens, Z.D., et al.: Big data: astronomical or genomical? PLOS Biol. 13(7), e1002195 (2015)
Article Google Scholar
Weaver, S.A., Ray, K.J., Marek, V.W., Mayer, A.J., Walker, A.K.: Satisfiability-based set membership filters. J. Satisf. Boolean Model. Comput. 8(3–4), 129–148 (2014)
MathSciNet MATH Google Scholar
Wood, D.E., Jennifer, L., Langmead, B.: Improved metagenomic analysis with Kraken 2. Genome Biol. 20(1), 257 (2019). https://doi.org/10.1186/s13059-019-1891-0
Article Google Scholar
Zielezinski, A., Vinga, S., Almeida, J., Karlowski, W.M.: Alignment-free sequence comparison: benefits, applications, and tools. Genome Biol. 18(1), 186 (2017). https://doi.org/10.1186/s13059-017-1319-7
Article Google Scholar

Download references

Acknowledgements

This work used HPC resources from the GenOuest bioinformatics core facility (https://www.genouest.org). The work was funded by ANR SeqDigger (ANR-19-CE45-0008).

Author information

Authors and Affiliations

Univ. Rennes, Inria, CNRS, IRISA, Rennes, France
Lucas Robidou & Pierre Peterlongo

Authors

Lucas Robidou
View author publications
You can also search for this author in PubMed Google Scholar
Pierre Peterlongo
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Lucas Robidou or Pierre Peterlongo .

Editor information

Editors and Affiliations

Université de Rouen Normandie, Mont-St-Aignan, France
Thierry Lecroq
CNRS, CRIStAL, Villeneuve d'Ascq, France
Hélène Touzet

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 129 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Robidou, L., Peterlongo, P. (2021). findere: Fast and Precise Approximate Membership Query. In: Lecroq, T., Touzet, H. (eds) String Processing and Information Retrieval. SPIRE 2021. Lecture Notes in Computer Science(), vol 12944. Springer, Cham. https://doi.org/10.1007/978-3-030-86692-1_13

Download citation

DOI: https://doi.org/10.1007/978-3-030-86692-1_13
Published: 27 September 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-86691-4
Online ISBN: 978-3-030-86692-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics